View AN1381_200643.PDF datasheet online --- IC-ON-LINE

Datasheet File OCR Text:

AN1381 APPLICATION NOTE
Implementing the Radix-4 FFT Algorithm Using the ST120 DSP
By Marianne DELPHIN
INTRODUCTION This application note for the 'Radix-4' implementation of the Fast Fourier Transform algorithm on the ST120 DSP shows how this algorithm well exploits the high parallelism of the 'ST100' superscalar architecture.
October 2001
1/22
AN1381 - APPLICATION NOTE
TABLE OF CONTENTS 1 1.1 1.2 2 2.1 2.2 2.3 2.4 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.5.1 2.4.5.2 2.4.6 3 3.1 3.2 4 FFT PRINCIPLES ...................................................................................................... FAST FOURIER TRANSFORM ................................................................................. RADIX-4 FFT ALGORITHM ....................................................................................... RADIX-4 FFT IMPLEMENTATION ON ST120 .......................................................... DATA ORGANIZATION .............................................................................................. AVOIDING DIGIT REVERSING ................................................................................. C CODE IMPLEMENTATION ..................................................................................... ASSEMBLY IMPLEMENTATION ............................................................................... Packed Arithmetic ....................................................................................................... Software Pipelining ..................................................................................................... Reload Counters ......................................................................................................... Bit Reversing .............................................................................................................. Data Management ...................................................................................................... Data Index Management ............................................................................................ Data Register Management ........................................................................................ ASM Code .................................................................................................................. COMPLEXITY AND PERFORMANCES .................................................................... PERFORMANCES ..................................................................................................... MEMORY USE ........................................................................................................... REFERENCES ........................................................................................................... Page 3 3 4 9 9 9 10 16 16 16 16 16 16 16 17 17 21 21 21 22
2/22
AN1381 - APPLICATION NOTE
1 - FFT PRINCIPLES 1.1 - Fast Fourier Transform The Fourier transform converts signals from the time domain into the frequency domain and vice versa. It is described by the following equation: Direct Fourier transform from time to frequency : (1) (f) =
x ( t )e -j2 ft dt
-
Inverse fourier transform from frequency to time: (2) ( t) =
x ( f ) e j2 ft df
-
To compute the Fourier transform using a fixed point DSP, the Discrete Fourier Transform which is the discrete version of the continuous Fourier transform is used. It is expressed as: (3) ( f) =
n = -
x (n T) e
- j2n Tf
for a signal which is sampled every T (the sampling period) seconds. As T is constant and assuming that the signal x(n) consists of N samples N-periodically repeated, we then have: (4) N-1 (k) =
x (n ) e
n=0
n - j2 k --N
that becomes, if the twiddle factor Wn is defined as: - j2 --N N-1 ( k) =
W
N
=e
(5)
x ( n )WN
n=0
nk
3/22
AN1381 - APPLICATION NOTE
One can notice the symmetry and periodicity properties of the twiddle factors as k N k N N k + --2 N k+N N
W
= -W
W
=W
1.2 - Radix-4 FFT Algorithm Direct computation would take N complex multiplications and N(N-1) complex additions which would lead to a complexity of O(N). To reduce this complexity, a new algorithm derived from the Cooley and Turkey [3] algorithm was created. This new algorithm, assuming N is a power of 4, divides an N-point Discrete Fourier Transform (DFT) into four N/4-point DFTs, then into 16 N/16-point DFTs and so on, up to arrive to a 4-point Fourier Transform.This algorithm is known as the radix-4 Fast Fourier Transform (radix-4 FFT). The complexity is then reduced to: Table 1 : Number of Operations of radix-2 and radix-4 FFTs
N 16 64 256 1024 radix-2 additions 152 1032 5896 30728 radix-4 additions 148 976 5488 28336 radix-2 multiplications 24 264 1800 10248 radix-4 multiplications 20 208 1392 7856
This leads to the decomposition of equation (5) into N --- - 1 4 (k) = N --- - 1 2 nk + N 3 --- - 1 4 nk +
N-1 x ( n )W nk N +
x ( n )WN
n=0
x ( n )WN
N --4
N --2
x ( n )WN
N 3 --4
nk
N --- - 1 4 =
0
N N N --- k 3 --- k --- k nk 4 2 4 N N N n + 3 --- W n + --- W n + --- W +x +x W x(n) + x N 4 N 2 N 4 N
as W N --- k 4 N N --- k 2 N = (-j ) k
W
= ( -1)
k
W
4/22
3N ------- k 4 N
= (j)
k
AN1381 - APPLICATION NOTE
It leads to : (6) N --- - 1 4 (k) =
n=0
k k k nk N N N x ( n ) + ( - j ) x n + --- + ( - 1 ) x n + --- + ( j ) x n + 3 --- W N 4 2 4
To get a 4-point DFT decomposition, let's decompose again as: (7) N --- - 1 4 nk N N N (k) = x ( n ) + x n + --- + x n + --- + x n + 3 --- W N 4 2 4 n=0
N --- - 1 4 (k + 1) =
n=0 N --- - 1 4
n(k + 1) N N N x ( n ) - jx n + --- - x n + --- + jx n + 3 --- W N 4 2 4
( k + 2) =
n=0 N --- - 1 4
n(k + 2) N N N x ( n ) - x n + --- + x n + --- - x n + 3 --- W N 4 2 4
(k + 3) =
n=0
n(k + 3) N N N x ( n ) + jx n + --- - x n + --- - jx n + 3 --- W N 4 2 4
This basic calculation is called a butterfly and is repeated for all the 4 point bundles.The butterfly pattern is shown graphically in Figure 1. Figure 1 : 16-Point radix-4 DIF FFT Butterfly
x(n) x(4r) -> leg1of the butterfly
x(n+N/4)
WN WN
n x(4r+1) -> leg2 of the butterfly
2n x(4r+2) -> leg3 of the butterfly 3n
x(n+N/2) W N
x(n+3N/4)
x(4r+3) -> leg4 of the butterfly
5/22
AN1381 - APPLICATION NOTE
Figure 2 : 16-Point radix 4 DIF FFT
x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x(11) x(12) x(13) x(14) x(15)
0 0 0 0 0 1 2 3 0 2 4 6 0 3 6 9
winspan
0 0 0 0
X(0) X(4) X(8) X(12) X(1) X(5) X(9) X(13) X(2) X(6) X(10) X(14) X(3) X(7) X(11) X(15)
interval
0 0 0 0 0 0 0 0 0 0 0 0
Stage 1
Stage 2
The number at the end of each branch of the butterfly corresponds to the twiddle factor exponent(i.e. 6 stands for WN6). It appears that the radix-4 algorithm is organized in butterflies, which are organized in groups of butterflies themselves organized in stages. In the example of a 16-point radix-4, two stages are available, four groups of one butterfly in the first stage and one group of four butterflies in the second stage. Groups vary from N/4 to 1 and butterflies per group vary from 1 to N/4.
6/22
AN1381 - APPLICATION NOTE
Figure 3 : radix-4 DIF FFT ALGORITHM
START
Initialize Stage loop
Set up for next stage Group loop
Set up for next group Butterfly loop
Set up for next butterfly
Compute butterfly
More butterfly? YES NO More groups? YES NO More stages? YES NO
Digit- reversed outputs
END
Data used to perform a butterfly calculation are called b0, b1, b2 and b3 (for example x(0),x(4),x(8) and x(12) in our previous example). They are separated by a distance called wingspan like bo from b1, b1 from b2 and so on.
7/22
AN1381 - APPLICATION NOTE
For the first stage, Winspan = N/4 and is decreasing following the rule: wingspan = winspan>>2. Data from two consecutive butterflies of the same groups are separated by a distance called interval.For the first stage interval = N as there is only one butterfly per group and at each stage, interval is decreasing following the same rule as winspan. The relationship between wingspan and interval is interval = 4*winspan. The algorithm can be described with flow chart in figure 3. The characteristics of a N-point radix 4 FFT are summarized in table 2: Table 2 : Characteristics of a N-Point radix-4
Stage Butterfly pergroup Butterfly group Twiddle Factor Exp. leg1 leg2 leg3 leg4 1 1 N/4 0 n 2n 3n n=0 to N/4-1 2 4 N/16 0 4n 8n 12n n=0 to N/16-1 3 16 N/64 0 16n 32n 48n n=0 to N/64-1 ... ... ... ... ... ... ... log4(N) N/4 1 0 (N/4)n (N/2)n (3N/4)n n=0
A way of saving some memory to perform the FFT is to do" in place " calculation (.i.e. the input data for a stage M+1 are the outputs of stage M). That way N*2 memory locations (one for real complex input and one for imaginary complex input) are saved. Then, due to the split at each stage of the 4kth, 4k+1 th, 4k+2th, 4k+3 th outputs of the previous stage, final outputs are ordered in a digit-reversed mode.It means that a normal ordered input array will produce a digit-reversed output array. An example of a digit-reversed array is given in the following table: Table 3 : Digit-Reverse Array
NORMAL ORDER Decimal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8/22 Binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Quaternary 00 01 02 03 10 11 12 13 20 21 22 23 30 31 32 33 DIGIT-REVERSED ORDER Quaternary 00 10 20 30 01 11 21 31 02 12 22 32 03 13 23 33 Binary 0000 0100 1000 1100 0001 0101 1001 1101 0010 0110 1010 1110 0011 0111 1011 1111 Decimal 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15
AN1381 - APPLICATION NOTE
2 - RADIX-4 FFT IMPLEMENTATION ON ST120 This section describes how to implement the radix-4 FFT algorithm on the ST120 and how to achieve a high performance program with low memory occupation thanks to the ST120 architecture. 2.1 - Data Organization To take advantage of the ST120 memory architecture (i.e. to be able to perform two data memory accesses in the same cycle), input data are splitted on two different memory banks. If NINPUTS is the number of points on which the FFT is performed then r[NINPUTS] is the real part array (in order) and i[NINPUTS]is the imaginary parts (in order) array. Real and imaginary parts of the twiddles are in the same array of 2*NINPUTS length, interleaved as real/ imaginary/real/imaginary... and so on. 2.2 - Avoiding Digit Reversing It can be shown that a radix-4 FFT with in order inputs and in order outputs can be performed using bit-reversing operations only. For that, at each stage of the FFT's calculation, the outputs of each butterfly have to be bit-reversed, i.e., if the outputs of a butterfly are numbered 0,1,2 and 3 the outputs 1 and 2 have to be swapped. For a 16-point FFT Figure 4 illustrates bit reversing operation. Figure 4 : 16-point FFT bit reversing
Bit reversing on each butterfly x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x(11) x(12) x(13) x(14) x(15) Stage 1 b0-0 b1-0 b2-0 b3-0 b0-1 b1-1 b2-1 b3-1 b0-2 b1-2 b2-2 b3-2 b0-3 b1-3 b2-3 b3-3 b0-0 b1-0 b2-0 b3-0 b0-2 b1-2 b2-2 b3-2 b0-1 b1-1 b2-1 b3-1 b0-3 b1-3 b2-3 b3-3 Stage 2 9/22 b0-0 b0-1 b0-2 b0-3 b1-0 b1-1 b1-2 b1-3 b2-0 b2-1 b2-2 b2-3 b3-0 b3-1 b3-2 b3-3 Bit reversing on each butterfly b0-0 b0-2 b0-1 b0-3 b1-0 b1-2 b1-1 b1-3 b2-0 b2-2 b2-1 b2-3 b3-0 b3-2 b3-1 b3-3 x(0) x(8) x(4) x(12) x(2) x(10) x(6) x(14) x(1) x(9) x(5) x(13) x(3) x(11) x(7) x(15)
AN1381 - APPLICATION NOTE
After the last stage data are then in bit-reversed order. A bit-reversing operation on the whole array is then needed to have in order outputs. As the ST120 has a bit-reverse addressing mode, this can be done very efficiently in the last stage of the calculation. 2.3 - C Code Implementation The following fixed point C code takes into consideration the features of the ST120 architecture to have a similar structure to the optimized assembly code. As data and twiddles are 16 bits variables they have been declared in "short " type.
void r4_full(short r[], short i[], short n, short w[]) { int stage, group, butterfly; int Stages, Groups, BpG; int WingSpan; int Wi, WStep; int Ri, Ii; short c0_1,s0_1, c0_2, s0_2, c0_3, s0_3; short c1_1,s1_1, c1_2, s1_2, c1_3, s1_3; short g0_0r, g0_0i, g0_1r, g0_1i, g0_2r, g0_2i, g0_3r, g0_3i; short g1_0r, g1_0i, g1_1r, g1_1i, g1_2r, g1_2i, g1_3r, g1_3i; short t0_0r, t0_0i, t0_1r, t0_1i, t0_2r, t0_2i, t0_3r, t0_3i; short t1_0r, t1_0i, t1_1r, t1_1i, t1_2r, t1_2i, t1_3r, t1_3i; short tt0_0r, tt0_0i, tt0_1r, tt0_1i, tt0_2r, tt0_2i, tt0_3r, tt0_3i; short tt1_0r, tt1_0i, tt1_1r, tt1_1i, tt1_2r, tt1_2i, tt1_3r, tt1_3i; Stages = 4;
10/22
AN1381 - APPLICATION NOTE
Calculation for a butterfly gives:
(*) for(butterfly=0; butterfly < BpG; butterfly++) {
g0_0i g1_0i g0_0r g1_0r g0_1i g1_1i g0_1r g1_1r g0_2i g1_2i g0_2r g1_2r g0_3i g1_3i g0_3r g1_3r t0_0i t1_0i t0_0r t1_0r t0_1i t1_1i t0_1r t1_1r t0_2i t1_2i t0_2r t1_2r t0_3i t1_3i t0_3r t1_3r
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
i[Ii]; i[Ii+1]; r[Ri]; r[Ri+1]; i[Ii+WingSpan]; i[Ii+WingSpan+1]; r[Ri+WingSpan]; r[Ri+WingSpan+1]; i[Ii+2*WingSpan]; i[Ii+2*WingSpan+1]; r[Ri+2*WingSpan]; r[Ri+2*WingSpan+1]; i[Ii+3*WingSpan]; i[Ii+3*WingSpan+1]; r[Ri+3*WingSpan]; r[Ri+3*WingSpan+1]; (short)(g0_0i (short)(g1_0i (short)(g0_0r (short)(g1_0r (short)(g0_0i (short)(g1_0i (short)(g0_0r (short)(g1_0r (short)(g0_1i (short)(g1_1i (short)(g0_1r (short)(g1_1r (short)(g0_1i (short)(g1_1i (short)(g0_1r (short)(g1_1r = = = = = = = = (short)(t0_0r (short)(t0_0i (short)(t1_0r (short)(t1_0i (short)(t0_0r (short)(t0_0i (short)(t1_0r (short)(t1_0i + + + + + + + + + + + + g0_2i); g1_2i); g0_2r); g1_2r); g0_2i); g1_2i); g0_2r); g1_2r); g0_3i); g1_3i); g0_3r); g1_3r); g0_3i); g1_3i); g0_3r); g1_3r); t0_2r); t0_2i); t1_2r); t1_2i); t0_2r); t0_2i); t1_2r); t1_2i);
// // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // //
Img(A0) Img(A1) Real(A0) Real(A1) Img(B0) Img(B1) Real(B0) Real(B1) Img(C0) Img(C1) Real(C0) Real(C1) Img(D0) Img(D1) Real(D0) Real(D1) Img(A0 + C0) Img(A1 + C1) Real(A0 + C0) Real(A1 + C1) Img(A0 - C0) Img(A1 - C1) Real(A0 - C0) Real(A1 - C1) Img(B0 + D0) Img(B1 + D1) Real(B0 + D0) Real(B1 + D1) Img(B0 - D0) Img(B1 - D1) Real(B0 - D0) Real(B1 - D1) = = = = = = = = (short)(t0_1r (short)(t0_1i (short)(t1_1r (short)(t1_1i (short)(t0_1r (short)(t0_1i (short)(t1_1r (short)(t1_1i + + + + t0_3i); t0_3r); t1_3i); t1_3r); t0_3i); t0_3r); t1_3i); t1_3r);
tt0_0r tt0_0i tt1_0r tt1_0i tt0_2r tt0_2i tt1_2r tt1_2i
tt0_1r tt0_1i tt1_1r tt1_1i tt0_3r tt0_3i tt1_3r tt1_3i
11/22
AN1381 - APPLICATION NOTE
Calculation for a butterfly gives (continued)
// 3 Complex Multiplies per butterfly // do a bitrev on the outputs. i[Ii] = tt0_0i; i[Ii+1] = tt1_0i; r[Ri] = tt0_0r; r[Ri+1] = tt1_0r; Ri += WingSpan; Ii += WingSpan; i[Ii] = (short)((c0_2*tt0_2i + s0_2*tt0_2r)>>15); i[Ii+1] = (short)((c1_2*tt1_2i + s1_2*tt1_2r)>>15); r[Ri] = (short)((c0_2*tt0_2r - s0_2*tt0_2i)>>15); r[Ri+1] = (short)((c1_2*tt1_2r - s1_2*tt1_2i)>>15); Ri += WingSpan; Ii += WingSpan; i[Ii] = (short)((c0_1*tt0_1i + s0_1*tt0_1r)>>15); i[Ii+1] = (short)((c1_1*tt1_1i + s1_1*tt1_1r)>>15); r[Ri] = (short)((c0_1*tt0_1r - s0_1*tt0_1i)>>15); r[Ri+1] = (short)((c1_1*tt1_1r - s1_1*tt1_1i)>>15); Ri += WingSpan; Ii += WingSpan; i[Ii] = (short)((c0_3*tt0_3i + s0_3*tt0_3r)>>15); i[Ii+1] = (short)((c1_3*tt1_3i + s1_3*tt1_3r)>>15); r[Ri] = (short)((c0_3*tt0_3r - s0_3*tt0_3i)>>15); r[Ri+1] = (short)((c1_3*tt1_3r - s1_3*tt1_3i)>>15); Ri += WingSpan; Ii += WingSpan; }
// // // // // // // // // // // // // // // //
Img(A0) Img(A1) Real(A0) Real(A1) Img(B0) Img(B1) Real(B0) Real(B1) Img(C0) Img(C1) Real(C0) Real(C1) Img(D0) Img(D1) Real(D0) Real(D1)
The first optimization is to use the packed arithmetic feature of the ST120. As the data for two consecutive butterflies in the same groups are consecutive in the data array, two half word data can be loaded at the same time in a word and two butterflies can be calculated at the same time with packed arithmetic. Therefore, the number of iteration of the group loop can be divided by 2. Some other simplifications in the C code can be derived from the structure of the radix-4 algorithm. First we can see from table-2 (Characteristics of an N-point FFT), that the different stages have different numbers of iteration for the group and butterfly loops.
12/22
AN1381 - APPLICATION NOTE
For the first stage, there is only one butterfly loop per group loop iteration so that it can be skipped.Then there is only one inner loop (the group's one).
Groups = n; BpG = 1; Wi = 0; Groups = Groups >> 2; WingSpan = Groups; Ri = 0; Ii = 0;
// ButterflysPerGroup
for(group=0; group < (Groups>>1); group++) { c0_1 = w[2*Wi]; s0_1 = w[2*Wi+1]; c1_1 = w[2*(Wi+1)]; s1_1 = w[2*(Wi+1)+1]; c0_2 s0_2 c1_2 s1_2 c0_3 s0_3 c1_3 s1_3 Wi + = = = = = = = = = w[2*2*Wi]; // loading the w[2*2*Wi+1]; // the twiddles w[2*2*(Wi+1)]; w[2*2*(Wi+1)+1]; w[3*2*Wi]; w[3*2*Wi+1]; w[3*2*(Wi+1)]; w[3*2*(Wi+1)+1]; 2;
BUTTERFLY CALCULTATION (*) } //for(group). 1st stage loop
13/22
AN1381 - APPLICATION NOTE
For the last stage, two simplifications can be performed. First, as for the first stage, one inner loop can be skipped as there is only one iteration per group loop for the whole stage. Secondly, if we look at the twiddle values for this stage, we can see that it is always (1,-1) so the multiplication by the twiddles are in fact only additions and subtractions.
Ri = 0; Ii = 0; BpG = BpG >> 1; for(butterfly=0; butterfly < BpG; g0_0i = i[Ii]; g0_1i = i[Ii+1]; g0_0r = r[Ri]; g0_1r = r[Ri+1]; g0_2i g0_3i g0_2r g0_3r g1_0i g1_1i g1_0r g1_1r g1_2i g1_3i g1_2r g1_3r t0_0i t1_0i t0_0r t1_0r t0_1i t1_1i t0_1r t1_1r = = = = = = = = = = = = = = = = = = = = i[Ii+2]; i[Ii+3]; r[Ri+2]; r[Ri+3]; i[Ii+4]; i[Ii+4+1]; r[Ri+4]; r[Ri+4+1]; i[Ii+4+2]; i[Ii+4+3]; r[Ri+4+2]; r[Ri+4+3]; (short)(g0_0i (short)(g1_0i (short)(g0_0r (short)(g1_0r (short)(g0_0i (short)(g1_0i (short)(g0_0r (short)(g1_0r + + + + g0_2i); g1_2i); g0_2r); g1_2r); g0_2i); g1_2i); g0_2r); g1_2r);
butterfly++) { // // // // // // // // // // // // // // // // // // // // // // // //
Img(A0) Img(B0) Real(A0) Real(B0) Img(C0) Img(D0) Real(C0) Real(D0) Img(A1) Img(B1) Real(A1) Real(B1) Img(C1) Img(D1) Real(C1) Real(D1) Img(A0 + C0) Img(A1 + C1) Real(A0 + C0) Real(A1 + C1) Img(A0 - C0) Img(A1 - C1) Real(A0 - C0) Real(A1 - C1)
14/22
AN1381 - APPLICATION NOTE
t0_2i t1_2i t0_2r t1_2r t0_3i t1_3i t0_3r t1_3r tt0_0r tt0_0i tt1_0r tt1_0i tt0_2r tt0_2i tt1_2r tt1_2i
= = = = = = = =
(short)(g0_1i (short)(g1_1i (short)(g0_1r (short)(g1_1r (short)(g0_1i (short)(g1_1i (short)(g0_1r (short)(g1_1r
+ + + + -
g0_3i); g1_3i); g0_3r); g1_3r); g0_3i); g1_3i); g0_3r); g1_3r); + + + + t0_2r); t0_2i); t1_2r); t1_2i); t0_2r); t0_2i); t1_2r); t1_2i);
// // // // // // // //
Img(B0 + D0) Img(B1 + D1) Real(B0 + D0) Real(B1 + D1) Img(B0 - D0) Img(B1 - D1) Real(B0 - D0) Real(B1 - D1) = = = = = = = = (short)(t0_1r (short)(t0_1i (short)(t1_1r (short)(t1_1i (short)(t0_1r (short)(t0_1i (short)(t1_1r (short)(t1_1i + + + + t0_3i); t0_3r); t1_3i); t1_3r); t0_3i); t0_3r); t1_3i); t1_3r);
= = = = = = = =
(short)(t0_0r (short)(t0_0i (short)(t1_0r (short)(t1_0i (short)(t0_0r (short)(t0_0i (short)(t1_0r (short)(t1_0i
tt0_1r tt0_1i tt1_1r tt1_1i tt0_3r tt0_3i tt1_3r tt1_3i
// do a i[Ii] i[Ii+1] r[Ri] r[Ri+1] Ii +=2; i[Ii] i[Ii+1] r[Ri] r[Ri+1] Ii +=2; i[Ii] i[Ii+1] r[Ri] r[Ri+1] Ii +=2; i[Ii] i[Ii+1] r[Ri] r[Ri+1] Ii +=2; }
bitrev on the outputs. = tt0_0i; = tt0_2i; = tt0_0r; = tt0_2r; Ri+=2; = tt0_1i; = tt0_3i; = tt0_1r; = tt0_3r; Ri+=2; = tt1_0i; = tt1_2i; = tt1_0r; = tt1_2r; Ri+=2; = tt1_1i; = tt1_3i; = tt1_1r; = tt1_3r; Ri+=2;
// // // // // // // // // // // // // // // //
Img(A0) Img(B0) Real(A0) Real(B0) Img(C0) Img(D0) Real(C0) Real(D0) Img(A1) Img(B1) Real(A1) Real(B1) Img(C1) Img(D1) Real(C1) Real(D1)
15/22
AN1381 - APPLICATION NOTE
2.4 - Assembly Implementation The C implementation already contains some optimizations to exploit the ST120 architecture, like separating the different stages of the calculation and processing two butterflies at the same time. 2.4.1 - Packed Arithmetic This last optimization was done to be able to use the packed arithmetic feature of the ST120. The ST120 can perform 2 Data Unit (DU) and 2 Address Unit (AU) operations on 32-bit words in the same cycle. With packed arithmetic, it can do up to 4 DU operations and 4 AU operations on 16-bit half-words in the same cycle if data are consecutive in memory. As previously described, data of 2 butterflies belonging to consecutive groups are consecutive in memory and can be then loaded, added or subtracted in the same packed instruction. Radix-4 butterfly takes 22 real additions and 12 real multiplications. Two butterflies calculations represent (22+12)*2 = 68 DU operations and thanks to the Multiply Accumulate(MAC) and Multiply Subtract (MSUB) operations and to packed arithmetic they can be performed in 20 cycles [2]. Similarly all the load operations are done in 32-bit format and since the MAC's result is a 32-bit data (16 msb needed only) store operations are done in 16-bit format. 2.4.2 - Software Pipelining Software pipelining is a code optimization technique where instruction level parallelism is exploited by overlapping the execution of successive iterations from an inner loop. It means that a prologue of the loop code is executed before the start of the loop and an epilog of the loop code is executed after the end of the loop.This allows to improve the combination of AU and DU instructions and to reduce the number of iteration by 1. No pure software pipelining is performed in the assembly code. Two packed loads are performed before the start of the loop to allow earlier start of the calculation in the first "sliw" group [1] of the asm code. Those two packed loads are then performed again near the end of the most inner loop (sliw group 15). 2.4.3 - Reload Counters Reload counters allow to reload a loop counter value without any waiting penalty (zero overhead) when two loops are overlapped. In our case, it is used in the intermediate stages to reload the butterfly and group loop counter value. It is also used in the first stage to precalculate the stage loop counter value for intermediate stages and in all the intermediate stages to precalculate the stage loop counter value for the last stage 2.4.4 - Bit Reversing Assembly version differs from the C in the fact that in assembly, the bit-reverse operation is done at the same time as the last stage calculation, thanks to the bit-reverse addressing mode of the ST120. Until the last stage, calculations are done" in place "i.e. data are loaded and stored in the same array. In the last stage, data are stored in order in another array just after they are calculated 2.4.5 - Data Management 2.4.5.1 - Data Index Management Instructions in the loop group are performed to calculate the data pointer address used in the butterfly loop and to reload twiddles values. Thanks to auto increment pointer of the ST120, the calculation of the data pointer address can be simplified. For the radix-4 FFT the following relationship is always true: 4*wingspan*butterflies-per-group = NINPUTS. The butterfly loop is performed butterflies-per-group times and the interval between two consecutive butterflies is 4*wingspan. Therefore, to calculate the next group data pointer address, only the subtraction of a constant is needed. If two groups are calculated at the same time, the subtraction of NINPUTS-2 is needed. NxtDataIndex = LastDataIndex - (NINPUTS-1(or2))
16/22
AN1381 - APPLICATION NOTE
2.4.5.2 - Data Register Management Data register use is quite intensive in the 'sliw' groups of the butterflies calculation. As two butterflies are calculated at the same time, it multiplies by two the number of operands and thus the number of data registers needed. An efficient way to save some registers is to put operand and destination data in the same register.Writing after each 'sliw' group which registers are used and which registers are free allows the programer to optimize the use of all the available data registers. 2.4.6 - ASM Code Asm code example for two butterflies calculation is:
ldp R0, @(P0 !+ P2) // R0H = Real(A1), R0L = Real(A0) ldp R2, @(P6 !+ P2) // R2H = Imag(A1), R2L = Imag(A0) nop nop // Rused = 2; Used: R0,R2 ; Avail: R1,R3-R15 _fft_group_loop_st: _fft_bfly_loop_st: // add network 1st level ldp R4, @(P0 + P2) // R4H = Real(C1), R4L = Real(C0) ldp R6, @(P6 + P2) // R6H = Imag(C1), R6L = Imag(C0) addcp R0, R0, R4 // t1_0r=R(A1+C1), t0_0r=R(A0+C0) subcp R4, R0, R4 // t1_1r=R(A1-C1), t0_1r=R(A0-C0) // Rused = 4; Used: R0,R2,R4,R6 ; Avail: R1,R3,R5,R7-R15 // add network 1st level ldp R1, @(P0 !+ P10) // R1H = Real(B1), ldp R3, @(P6 !+ P10) // R3H = Imag(B1), addcp R2, R2, R6 // t1_0i=I(A1+C1), subcp R6, R2, R6 // t1_1i=I(A1-C1), // Rused = 6; Used: R0,R1,R2,R3,R4,R6 ; Avail: // add network 1st level ldp R5, @(P0 !- P10) ldp R7, @(P6 !- P10) addcp R1, R1, R5 subcp R5, R1, R5 // Rused = 8; Used: R0-R7 ; // add network 2nd level ldw R8, @(P1 - P4) addcp R0, R0, R1 subcp R1, R0, R1 sdp @(P0 - P2), R0 // Rused = 8; Used: R1-R8 ;
R1L = Real(B0) R3L = Imag(B0) t0_0i=I(A0+C0) t0_1i=I(A0-C0) R5,R7-R15
// R5H = Real(D1), // R7H = Imag(D1), // t1_2r=R(B1+D1), // t1_3r=R(B1-D1), Avail: R8-R15
R5L = Real(D0) R7L = Imag(D0) t0_2r=R(B0+D0) t0_3r=R(B0-D0)
// W0_1=R8 H,L: s0_1, c0_1 // tt1_0r=(t1_0r+t1_2r),tt0_0r=(t0_0r+t0_2r) // tt1_2r=(t1_0r-t1_2r),tt0_2r=(t0_0r-t0_2r) // tt0_0r, tt1_0r Avail: R0,R9-R15
17/22
AN1381 - APPLICATION NOTE
Asm code (continued)
// add network 1st level ldw R10, @(P1 + P4) // W0_3=R10 H,L: s0_3, c0_3 addcp R3, R3, R7 // t1_2i=I(B1+D1), t0_2i=I(B0+D0) subcp R7, R3, R7 // t1_3i=I(B1-D1), t0_3i=I(B0-D0) addba P4, P4, P3 // Rused = 8; Used: R1,R3-R8,R10 ; Avail: R0,R2,R9, R11-R15 // add network 2nd level ldw R9, @(P1 !+ P5) // W0_2=R9 H,L: s0_2, c0_2 addcp R2, R2, R3 // tt1_0i=(t1_0i+t1_2i),tt0_0i=(t0_0i+t0_2i) subcp R3, R2, R3 // tt1_2i=(t1_0i-t1_2i),tt0_2i=(t0_0i-t0_2i) sdp @(P6 - P2), R2 // tt0_0i, tt1_0i // Rused = 9; Used: R1,R3-R10 ; Avail: R0,R2,R11-R15 // add network 2nd level nop ldw R11, @(P1 - P4) // W1_1=R11 H,L: s1_1, c1_1 addcp R4, R4, R7 // tt1_1r=(t1_1r+t1_3i),tt0_1r=(t0_1r+t0_3i) subcp R7, R4, R7 // tt1_3r=(t1_1r-t1_3i),tt0_3r=(t0_1r-t0_3i) // Rused = 10; Used: R1,R3-R11 ; Avail: R0,R2,R12-R15 // add network 2nd level ldw R13, @(P1 + P4) // W1_3=R13 H,L: s1_3, c1_3 addcp R5, R6, R5 // tt1_3i=(t1_1i+t1_3r),tt0_3i=(t0_1i+t0_3r) subcp R6, R6, R5 // tt1_1i=(t1_1i-t1_3r),tt0_1i=(t0_1i-t0_3r) subba P4, P4, P3 // Rused = 11; Used: R1,R3-R11,R13 ; Avail: R0,R2,R12,R14-R15 // Multiplies // group 1, index 1 ldw R12, @(P1 !- P5) // W1_2=R12 H,L: s1_2, c1_2 nop mpfchl R0, R1, R12 // Res1_2r = tt1_2r * c1_2 mpfhh R2, R1, R12 // Res1_2i = tt1_2r * s1_2 // Rused = 14; Used: R0-R13 ; Avail: R14-R15
18/22
AN1381 - APPLICATION NOTE
Asm code (continued)
msfchh R0, R0, R3, R12 // Res1_2r mafchl R2, R2, R3, R12 // Res1_2i sdf @(P0 + 2), R0 // Res1_2r sdf @(P6 + 2), R2 // Res1_2i // Rused = 11; Used: R1,R3-R11,R13 ;
= Res1_2r - tt1_2i * s1_2 = Res1_2i + tt1_2i * c1_2 ,Store to index 1 for bitrev Avail: R0,R2,R12,R14-R15
// group 0, index 1 nop nop mpfcll R0, R1, R9 // Res0_2r = tt0_2r * c0_2 mpflh R2, R1, R9 // Res0_2i = tt0_2r * s0_2 // Rused = 13; Used: R0-R11,R13 ; Avail: R12,R14-R15 msfclh R0, R0, R3, R9 // Res0_2r = Res0_2r mafcll R2, R2, R3, R9 // Res0_2i = Res0_2i sdf @(P0 !+ P2), R0 // Res0_2r, Store to sdf @(P6 !+ P2), R2 // Res0_2i // Rused = 8; Used: R4-R8,R10-R11,R13 ; Avail: // group 1, index 2 nop nop mpfchl R9, R4, R11 mpfhh R12, R4, R11 msfchh R9, R9, R6, R11 mafchl R12, R12, R6, R11 sdf @(P0 + 2), R9 sdf @(P6 + 2), R12 - tt0_2i * s0_2 + tt0_2i * c0_2 index 1 for bitrev R0-R3,R9,R12,R14-R15
// Res1_1r = tt1_1r * c1_1 // Res1_1i = tt1_1r * s1_1 // // // // Res1_1r = Res1_1r - tt1_1i * s1_1 Res1_1i = Res1_1i + tt1_1i * c1_1 Res1_1r, Store to index 2 for bitrev Res1_1i
// group 0, index 2 // Load Ar, Ai for next 2 groups ldp R0, @(P0 + P10) // R0H = Real(A1), R0L = Real(A0) ldp R2, @(P6 + P10) // R2H = Imag(A1), R2L = Imag(A0) mpfcll R9, R4, R8 // Res0_1r = tt0_1r * c0_1 mpflh R12, R4, R8 // Res0_1i = tt0_1r * s0_1
19/22
AN1381 - APPLICATION NOTE
Asm code (continued)
msfclh R9, R9, R6, R8 mafcll R12, R12, R6, R8 sdf @(P0 !+ P2), R9 sdf @(P6 !+ P2), R12 // group 1, index 3 nop nop mpfchl R9, R7, R13 mpfhh R12, R7, R13 msfchh R9, R9, R5, R13 mafchl R12, R12, R5, R13 sdf @(P0 + 2), R9 sdf @(P6 + 2), R12 // group 0, index 3 nop nop mpfcll R9, R7, R10 mpflh R12, R7, R10 msfclh R9, R9, R5, R10 mafcll R12, R12, R5, R10 sdf @(P0 !+ P10), R9 sdf @(P6 !+ P10), R12 _fft_bfly_loop_en: ldp R0, @(P0 - P14) // R0H = Real(A1), R0L = Real(A0) ldp R2, @(P6 - P14) // R2H = Imag(A1), R2L = Imag(A0) nop nop // Rused = 2; Used: R0,R2 ; Avail: R1,R3-R15 nop nop subha P0, P0, P8 subha P6, P6, P8 nop nop addha P4, P4, P3 addwa P1, P1, P7 _fft_group_loop_en: nop nop shra2 P2, P2 GP32md
20/22
// // // //
Res0_1r = Res0_1r - tt0_1i * s0_1 Res0_1i = Res0_1i + tt0_1i * c0_1 Res0_1r , Store to index 2 for bitrev Res0_1i
// Res1_3r = tt1_3r * c1_3 // Res1_3i = tt1_3r * s1_3 // // // // Res1_3r = Res1_3r - tt1_3i * s1_3 Res1_3i = Res1_3i + tt1_3i * c1_3 Res1_3r Res1_3i
// Res0_3r = tt0_3r * c0_3 // Res0_3i = tt0_3r * s0_3 // // // // Res0_3r = Res0_3r - tt0_3i * s0_3 Res0_3i = Res0_3i + tt0_3i * c0_3 Res0_3r Res0_3i
// WingSpan=Groups
AN1381 - APPLICATION NOTE
3 - COMPLEXITY AND PERFORMANCES 3.1 - Performances To evaluate the theoretical performances of a NINPUTS points FFT, first the complexity of each separate loop has to be evaluated and then the complexity of each different stage (first, intermediate, last). - Stage loop instruction (for intermediate stages): each instruction in this loop is performed [log4(NINPUTS)-2] times. - Group loop instructions: each instruction in this loop is performed (NINPUTS/4-4)/3 times. - Butterfly loop instructions: each instruction in this loop is performed [(log4(NINPUTS)-2)*NINPUTS/4] times. The complexity of each different stage of our assembly code is calculated as follows: - The complexity of the first stage: 20 cycles*1/2 * N/4 groups + constant - The complexity of the last stage: 7 cycles * N/4 butterflies + constant - The complexity of the intermediate stages: 20 cycles * 1/2 * [(log4(NINPUTS)-2)*NINPUTS/4] (butterfly loop instructions) + 3 cycles * (NINPUTS/4-4)/6 (group loop instructions) + 9 cycles * (log4(NINPUTS)-2) (stage loop instructions) + constant The complexity of the whole algorithm is the sum of the complexity of each stage: cycles count = 20*1/2*NINPUTS/4 + 3*(NINPUTS/4-4)/6 + 20*1/2 * [(log4(NINPUTS)-2)*NINPUTS/4] + 9*(log4(NINPUTS)-2) + 7*NINPUTS/4 + cst Cycle Count =35*NINPUTS/8+(5*NINPUTS/2+9)*(log4(NINPUTS)-2)-2 That is to say
256 complex points radix-4 FFT 1024 complex points radix-4 FFT 2416 + (constant=35) = 2451 cycles 12185 + constant
Cycle count for radix-4 The average number of 32-bit instructions performed at each cycle by the ST120 is 3.51. A ST120 EVA development ship has been used for measurement. 3.2 - Memory Use The code size for the radix-4 function is of 1132 bytes. For data memory, it has been explained before that to allow access decoupling on ST120, input data had to be separated on two different memory banks. Each of them occupies NINPUTS half-words so for 256 points input data take 2*256*2 = 1 Kbyte of memory. Twiddles array is composed of 2*NINPUTS half-words so it also occupies 1 Kbyte of memory. FFT calculation can be done "in place"( i.e. data is written at the same memory place where it has been read). After the last stage, data have to be bit-reversed to be in natural order. 2*NINPUTS additional memory locations will be required to store data in natural order. Table 4 shows the overall memory requirement. Table 4 : Overall Memory requirement
Program memory 1132 Bytes RAM 2048 Bytes ROM 1024 Bytes 21/22
AN1381 - APPLICATION NOTE
4 - REFERENCES [1] ST120 DSP-MCU Core Reference Guide available on www.st.com/st100/. [2] ST120 DSP-MCU Instruction Set Reference Guide available on www.st.com/st100/. [3] J.W. Cooley and J.W. Turkey "An algorithm for the machine calculation of complex Fourier series", Math.Comp., vol 19 April 1965 pp 297-301.
Information furnished is believed to be accurate and reliable. However, STMicroelectronics assumes no responsibility for the consequences of use of such information nor for any infringement of patents or other rights of third parties which may result from its use. No license is granted by implication or otherwise under any patent or patent rights of STMicroelectronics. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all information previously supplied. STMicroelectronics products are not authorized for use as critical components in life support devices or systems without express written approval of STMicroelectronics. The ST logo is a registered trademark of STMicroelectronics (c) 2001 STMicroelectronics - All Rights Reserved STMicroelectronics GROUP OF COMPANIES Australia - Brazil - Canada - China - Finland - France - Germany - Hong Kong - India - Israel - Italy - Japan - Malaysia - Malta - Morocco Singapore - Spain - Sweden - Switzerland - United Kingdom - United States http://www.st.com
22/22

▲Up To Search▲

Price & Availability of AN1381

	To Download AN1381 Datasheet File
If you can't view the Datasheet, Please click here to try to view without PDF Reader .